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Abstract 



Factorial hidden Markov models (FHMMs) 
are powerful tools of modeling sequential 
data. Learning FHMMs yields a challeng¬ 
ing simultaneous model selection issue, i.e., 
selecting the number of multiple Markov 
chains and the dimensionality of each chain. 
Our main contribution is to address this 
model selection issue by extending Factorized 
Asymptotic Bayesian (FAB) inference to FH¬ 
MMs. First, we offer a better approximation 
of marginal log-likelihood than the previous 
FAB inference. Our key idea is to integrate 
out transition probabilities, yet still apply 
the Laplace approximation to emission prob¬ 
abilities. Second, we prove that if there are 
two very similar hidden states in an FHMM, 
i.e. one is redundant, then FAB will al¬ 
most surely shrink and eliminate one of them, 
making the model parsimonious. Experimen¬ 
tal results show that FAB for FHMMs sig¬ 
nificantly outperforms state-of-the-art non- 
parametric Bayesian iFHMM and Variational 
FHMM in model selection accuracy, with 
competitive held-out perplexity. 


1 Introduction 

The Factorial Hidden Markov Model (FHMM) [H] is 
an extension of the Hidden Markov Model (HMM), 
in which the hidden states are factorized into several 
independent Markov chains, and emissions (observa¬ 
tions) are determined by their combination. FHMMs 
have found successful applications in speech recogni¬ 
tion m, source separation [Him [Hill] , natural lan¬ 
guage processing ^ and bioinformatics m- A graph¬ 
ical model representation of the FHMM is shown in 
FigH] 


Figure 1: Graphical Model Representation of FHMMs. 

Learning FHMMs naturally yields a challenging simul¬ 
taneous model selection issue, i.e., how many inde¬ 
pendent Markov chains we need (layer-level model se¬ 
lection), and what is the dimensionality (number of 
hidden states) of each Markov chain. As the model 
space increases exponentially with the number of the 
Markov chains, it is not feasible to employ a grid search 
based method like cross-validation [2]. More sophis¬ 
ticated methods, like variational Bayesian inference 
(VB-FHMMs) [5 and the nonparametric Bayesian In¬ 
finite FHMMs (iFHMMs) [12] have been proposed, but 
they cannot fully address the issue. For example, iFH¬ 
MMs restrict the dimensionality of each Markov chain 
to be binary. Further, high computational costs of 
VB-FHMMs and iFHMMs restrict their applicability 
to large scale problems. 

This paper addresses the above model selection issue of 
FHMMs by extending Factorized Asymptotic Bayesian 
(FAB) inference, which is a recent model selection 
framework and has shown superior performance than 
nonparametric Bayesian inference for Mixture Models 
[7], Hidden Markov Models |5], Latent Feature Mod¬ 
els |2|, and Hierarchical Mixture of Experts Models 
[5|. Our method, namely FABfhmm, not only fully 
addresses the simultaneous model selection issue, but 
also offers the following two contributions. 

1) Better marginal log-likelihood approxima¬ 
tion with partial marginalization technique: 
Previous FAB inference for HMMs |5] has applied the 
Laplace approximations both to emission probabilities 
and to transition probabilities. Our approach applies 

















the Laplace approximations only to emission probabili¬ 
ties, and transition probabilities are integrated out. As 
we do not conduct asymptotic approximation but fol¬ 
low exact marginalization on transition probabilities, 
we can better approximate marginal log-likelihood. 

2) A quantitative analysis of the shrinkage pro¬ 
cess; One of the strong features in FAB inference 
is a shrinkage effect, caused by FAB-unique asymp¬ 
totic regularization, on hidden variables , i.e., redun¬ 
dant hidden states are automatically removed during 
the FAB EM-like iterative optimization. Although its 
strong model selection capability has been empirically 
confirmed Ellin], its mathematical behavior has not 
been well studied. This paper carefully investigates 
the shrinkage process, and proves that if there are two 
very similar hidden states in FHMMs, then one state 
would almost surely “die out”. We reveal the follow¬ 
ing chain reaction between E-steps and M-steps: if the 
variational probabilities of some states are shrunk in 
an E-step, then their corresponding model parameters 
will also be shrunk in the next M-step, causing these 
state to be shrunk further in future E-steps. More¬ 
over, under certain condition, this shrinkage process 
is accelerating. This finding also partially answers the 
parameter identifiability problem of FHMMs. 


den states may give us better understanding of the 
data. Second, the slice sampling used to optimize the 
iFHMM is considerably slower than variational meth¬ 
ods, and therefore iFHMMs do not scale well to large 
scale scenarios. 

2.2 FAB Inference 

Factorized asymptotic Bayesian (FAB) inference has 
been originally proposed to address model selection of 
mixture models E- FAB inference selects the model 
which maximizes an asymptotic approximation of the 
marginal log-likelihood of the observed data, referred 
to as Factorized Information Criterion (FIC), with an 
EM-like iterative optimization procedure. Previous 
studies have extended FAB inference to HMMs (se¬ 
quential) [6| and Latent Feature Models (factorial) [9] , 
and have shown superiority against variational infer¬ 
ence and nonparametric Bayesian methods in terms 
of model selection accuracy and computational cost. 
It is an interesting open challenge to investigate FAB 
inference for their intermediate (both sequential and 
factorial) models, i.e., FHMMs. 

3 Factorial Hidden Markov Models 


2 Related Work 

2.1 Factorial HMMs 

The pioneering work of FHMMs was by Z. Ghahra- 
mani and M. Jordan [5|. The idea of FHMMs is to 
express emissions (observations) by combining M in¬ 
dependent Markov chains. If the individual Markov 
chain have binary states, the FHMM can express 
2^ different emission distributions. Their inference, 
structured variational inference, uses Baum-Welch al¬ 
gorithm m to learn parameters. Generally speak¬ 
ing, variational Bayesian (VB) inference [Ml [1] can 
prune redundant latent states, but previous studies 
have suggested the pruning effect is not strong enough 
to achieve good model selection performance. 

Recently, infinite FHMMs (iFHMMs) have been pro¬ 
posed [19] to address the model selection issue of FH¬ 
MMs. iFHMMs employ the Markov Indian Buffet 
Process (mIBP) as the prior of the transition matrix 
of the infinite hidden states. Although iFHMMs of¬ 
fer strong model selection ability in learning FHMMs, 
they have a few limitations. First, iFHMMs restrict la¬ 
tent variables to be binary while the original FHMMs 
[5] have no such limitation. This restriction makes 
iFHMMs generate more represented states than vari¬ 
ational methods, tending to overfit the data. In con¬ 
trast, excluding such a restriction and allowing differ¬ 
ent Markov chains to have different numbers of hid- 


Suppose we have observed N independent sequences, 
denoted as = x^,--- ,x^. The n-th sequence 
is denoted as a;" = a:" , • • • , x'tf , where r„ is the 
length of the n-th sequence. Respectively, we de¬ 
note the corresponding sequences of hidden state vari¬ 
ables as = z^, - ■■ ,z^, and each sequence z'^ = 
zf,Z 2 ,--- , zJf^. z^ consists of hidden variables of M 


Zf = 2; 


",( 1 ) n,{2) 


n,(M) 


independent HMMs: 

The m-th HMM has hidden states, i.e., 2 ;"’*'’”^ G 
{0,1}^"*. We represent as a ■ 1 binary 

vector, where the A:-th component, denotes 


whether state 


= k. 


An FHMM model is specified as the following proba¬ 
bility density: 

M T„ 

m—1 t—2 

T 

Y[p{Xt\Zt,(t>), (I) 

t=l 


where 9 = (a,f},(p). The initial, transition and 
emission distributions are represented as ja™), 

andp(a;”| 2 ;”,^), respectively. 

The m-th initial and transition distributions are de¬ 
fined as follows: 




(m). 
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Here a 


(m) 


= (c 


-k1) 


with 


= 1 ; 


(/3|™^) , with each row summing to 1 . 


By following the original FHMM work [S], this paper 
considers multidimensional Gaussian emission, while it 
is not difficult to extend our discussion to more general 
distributions like one in the exponential family. The 
emission distribution is jointly parameterized across 
the hidden states in the M HMMs as follows: 


( 2 ) 

where /x" and C are the mean vector and covariance 
matrix. A key idea (and difference from HMMs) in 
FHMMs is the construction of the mean vector /x". 
More specifically, the mean vector /i," is represented 
by the following linear combination: 

M 

/xr = E (3) 

m—l 

where W™ is a Z? x Km matrix. The fc-th row of VF™ 
is denoted as specifying the contribution of the 

fc-th state in the m-th HMM to the mean. Here, (p in 
Q can be represented as cp = {W,C). 


4 Refined Factorized Information 
Criterion for FHMMs 


FIG derivation starts from the following equivalent 
form of marginal log-likelihood: 

\ogp{x^\M) ^ maxE (4) 

q ^ q{z^) 

M represents a model and M = {M, Ki, - ■ ■ , Km) in 
FHMMs. q{z) is a variational distribution over the 
hidden states. For a state vector 2 "’'", its expecta¬ 
tion under g. Eg [ 2 :”’™], is denoted in shorthand as 
5 ( 2 "’'"), and the expectation of two consecutive states 

p \n,m ^ ri.mi 

^t,k 1 S'® )■ 

A direct application of the technique proposed in BE] 
leads to the following asymptotic approximation of the 
complete marginal log-likelihood: 

M 

p{x^ ,z^\M) = 

m—l 

^ Laplace Approx. 

f n’ (3”^)p{f3^\M)dr } 

t^2 

■V 

^ Laplace Approx. 

\\p{x7\z^,<t^)p{^\M)d<p. (5) 

Laplace Approx. 




Instead, this paper proposes the FIG derivation with 
integrating out the initial and transition distribu¬ 
tions. 


M „ 

p{x^,z^\M)= n{y 


m—l s. 


Y[p{^ 




Integrated Out 

i^r\pnp{(3nM)dp^} 


Integrated Out 

. T 

Y[p{xt\zt,cp)p{cp\M)d(p. 


( 6 ) 


Laplace Approx. 


The original motivation of FIG is to approximate “in¬ 
tractable” Bayesian marginal log-likelihood. The key 
idea in (|^ is to refine approximation by analytically 
solving the tractable integrations (w.r.t. a™ and /3™) 
and minimizing the asymptotic approximation error. 

After integrating out all the parameters, and normal¬ 
izing the Hessians w.r.t. W and C~^, we obtain: 

With uninformative conjugate priors on a"* and /3”^, 
we can calculate ®> as follows: 


p{x^, 2^|A4) 

, -pr / life h(Cni,0,fc + 1) TT rifc h(Cm,j,fc + 1) \ 

m Cm, 0 ,fe + Aim) ^ -f ZF^) ^ 


D 


TT,„ , iio±i / TT- T„ 1 - ^ . 

ll(27r) ^ (11 y) 

d —1 n —1 

M,K^ N,T^ 

n <i; 

m,k—l n,t—l 


(7) 


where Kq = Km, and - is the normalized 
Hessian w.r.t. the d-th rows of W and C~^ at the 
maximum likelihood estimators W, C. The normal¬ 
ization makes \^WdCd\ ~ by extracting the 

terms involving 2 . 

Q computes the expectation of logp(a;-^, 2 -^|Af) 
w.r.t. the variational distribution q. Taking the log¬ 
arithm of 0 . many log-Gamma terms appear, whose 
exact expectations require to enumerate the exponen¬ 
tially many possible configurations of 2 , which is infea¬ 
sible. Thus we propose a first-order approximation to 
them, based on the following lemmas, where ei,e 2 ,e 3 
are small bounded errors. 

Lemma 1. Suppose {z”,--- , 2 ^ are N se¬ 

quences of Bernoulli random variables. Within each 
sequence, are independent with each other. 

Let Un = ■: Vn = ElVn]- Besides, there are N 

















numbers {y-n}, '^k,yn ~ j/n- When all Tn are large 
enough: 

i;E[logr(y„)] = y„logy„-(y„ + ilogy„)+log27r+ei. 

2) E [E„logr(y„)-logr(E„ y„)] = E„ ynlog( ^»^" 

+ ^ log ( E„ yn) - 5 En logyn + {N - 1) log 27r + € 2 . 

3) E[logj/„] = logy„ + |(?/n - Vn) + £3. 


Proof can be found in the Appendix. 

A j £! j i_ —1 n,m n,m o* 

As dehned above, Cm, 3 ,k = 

follows a Markov process, the dependency 
between z"j"* and j vanishes quickly when the 

time gap At increases, leading to the uncoupling be- 
tween and Zt+At,jh+i+At,k- Thus we can 

approximately regard Cm,j,k as the sum of independent 
Bernoulli random variables, and therefore Lemma 1 
apply. The same argument applies to Cm,o,k- 


In order to avoid recursion relations w.r.t. q dur¬ 
ing the inference, we introduce an auxiliary dis¬ 
tribution q, which is always close to q, in that 

The Gamma terms in ([7| will be expanded about the 


expectations of their parameters w.r.t. q. We will dis¬ 
cuss how to choose q in Section [5^ 


Taking the logarithm of Q, applying Lemma 1, and 
dropping asymptotically small terms (including the de¬ 
terminants of the normalized Hessian matrices), we 
derive the approximation of p{x^,z^\M). Plugging 
it into Q, we obtain an asymptotic approximation of 
the lower bound of logp(a;'^|M): 


\ogp{x^\M)> J{q,x^) 

' E f E ( w,c)+y^ z^^ log si 


n,t 


m,k 


M,N 


Tn-1 


E l \ '' n.m 1 '-m , \ '' \ '' n.m n.m i Am 

(E^o.fc iog«fc + E E^tj 


m,n—l k 


- \ogq{z^) 


t=l j,k 


E^logA™-— El log(Cm...fc) 


m.,n,t 


,,k—l 


cy ^ ^ “b 1 ) cy ^ ^ lo§(Cmj,fc H“ 1 ) 


m,k 


m,j,k 


4 E log(E + Km) + ^ E ^°g(^ + 
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— E + E - 1) log 27r -b e, 


with the definitions 


•'/ n,m\ j~i ( n,m\ 

). 

/ n.m n.m \ m / n.m n.m \ 

^h+l,k) = Kq[Z^j ,Zt+i_fc) 

N 

Ar,Trr 

- \ '' n.m\ 

Cm,0,fc — 2_^ ) 

^ \ '' n.m\ 

, Cm^-^k = 2^ q{z^l ), 

n—1 

n,i—1 

N,Tr^-l 


Sm,j,k = E! 9(^t, 

.,m n,m \ 


n,t—l 


xm 

^k 


c^k 


PI 


j,k 


1 
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2E5y,<i«r)’’ 


1 I n,m\ 

1 + Ln=l g(^l,fc ) 

K^ + N 

1 1 —1 n.m 

1 + 


n.m \ 
1 ^t+l,k) 


K„ 


n,t=l 90 


■t.i 


(9) 


and A"* is a normalization constant that makes 
SEi <5™ = 1, e is a small constant error bound of the 
approximations. Note the ML estimators LE, C inside 
the summation are subject to the specific instantiation 
ofz^. 


In Q, d and (3 can be viewed as the estimated initial 
transition probabilities of the FHMM w.r.t. q. 


The FIG of the FHMM, denoted by FlCfhmmi is ob¬ 
tained by maximizing J{q,x^) w.r.t. g: 

FIG(a:'^, Af) = max{J'(g, a;^)}. (10) 

9 


Similar to FICmm and FlChmm, the use of FIC/ft,mm as 
the approximation of the observed data log-likelihood 
under a certain model is justihed: 

Theorem 2. FIC{x^, A4) is asymptotically equiva¬ 
lent to logp(a;'^|Af). 

The proof is analogous to that of FIGmm and 
PlC/imm [21 [6], and omitted here. 


5 FAB for FHMMs 

5.1 FAB’s Lower Bound of FIC 

Since W, C depends on the specific instantiation of 
z^, in order to compute FlCfhmm exactly, we need to 
evaluate W, C for each instantiation of z’^, which is 
infeasible. So we bound FIG(a;'^,M) from below by 
setting all VF, C to the same values, and get a relaxed 
lower bound Q: 

Q{q,x^,W,C) < J{q,x^) < \ogp{x^\M). 

FABfhmm is an optimization procedure that tries to 
maximize the above lower bound of FlCfhmm- 

M*,qpW*,C*,= argma-x g{q,x^,W,C). (11) 

M,q,W,C 


( 8 ) 












5.2 FAB Variational EM Algorithm 


Let us first fix the structural parameters of the FHMM 
model, i.e. M, Ai, • • • , Km- Then our objective is to 
optimize 0 w.r.t. {q,W,C). In this phase FAB is 
a typical variational EM Algorithm, iterating between 
E-steps and M-steps. We denote the f-th iteration with 
the superscript {f}. 


In (|^, the auxiliary distribution q is required to be 
close to q. So during the FAB optimization, we always 
set q to be the variational distribution in the previous 
iteration, i.e. q^''^ — . 


In the f-th E-step, we fix TV, C = 

and q = The E-step computes the q which 


maximizes (11). 


The exact E-step for FHMMs is intractable [8], and 
thus in E-step, we adopt the Mean-field Variational 
Inference proposed in [5]. 


5.2.1 FAB E-Step 

Mean-Field Variational Inference The Mean- 
Field Variational Inference approximates the FHMM 
with M uncoupled HMMs, by introducing a varia¬ 
tional parameter for each layer variable 

g^ppj.Qj5-iinates the contribution of to the 

corresponding observation x^. 

The structured variational distribution q is defined as 

N,M T„ 

=§ n i^)!! Kfr’. fc)}, 

n,m—l t—2 

( 12 ) 

where Z is the normalization constant, and 


constructs a matrix whose diagonal is v, and all off- 
diagonal elements are 0; is the residual: 


M 


- n,(m) 


= - E 

l^m 


n.(/)) 


(15) 


In (14), h depends on q, but q also depends on h in 
the Forward-Backward routine, in a complicated way. 
Such an interdependence makes the exact solution dif¬ 
ficult to find. Therefore in (15), we use to get 

approximate solutions of h. 


Note the bias incorporates the effects of the 

shrinkage factor j™. Through the bias, the shrinkage 
factors will induce the Hidden State Shrinkage to be 
discussed later. 


Forward-Backward Routine After updating the 
variational parameters h, we use the Forward- 
Backward algorithm to compute the sufficient statis¬ 


tics q( 2 :”’^*^), and qiztL^l^h 




The following recurrence relations for the forward 
quantities and backward quantities is de¬ 

rived from the definition (T2][T3) of q: 


pn,{m) 

J t,k 
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n,(m) 

t,k 




h 
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n,(m) »(m) 
l,k “fe 
n,{m) K„ 
t,k ^j — 
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pn,{m) A(m) 


if 


if t>l 


1 


c 

1 


n.fm) 
t + 1 


Zj=l 


Mm) ,n,{m) ,n,{m) 


if t<T„ 


if t=T„ 


(16) 


where is the normalization constant that makes 

E Km _ -| 

Jt,k ~ 




Krr 


(m) 


= n fit’ 

(13) 






Note is a 1 X Km vector, which gives a bias for 

each of the Km settings of . 

We obtain a system of equations for and 

^(^n,(m))^ which minimize KL{q\\p{z^\x^,0)), by 
setting its derivation w.r.t. to 0: 

=diag{5™}exp{TV™'C-i&”’('”^ 

(14) 

where A™ is the vector consisting of the diagonal el¬ 
ements of TV™ C“^TV™; diag(u) is an operator that 


The sufficient statistics are computed thereafter: 


9C 


/ n,(r 


n,(m) 


nAm) 

\k 

n,(m) 

t,k 


) = fi 
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n,(m) 7 n,(m) 

t,fc b k 1 


(17) 


i,(m) 


n(m) j nAiiT') 1 (tn 

ft-\J P),k Kk b) 


t,k 


(18) 


5.2.2 M-Step 

In the M-step, we fix q = q = q^*^, and maximize 0 
W.r.t. obtaining their optimal values: 


N,Tr, 


M 


C = mdiag{^ E («'-E 

/ n,f—1 m—1 


C N,Tri \ /N,Tri 

E <<?(^r') E 

n,t—^ / \n,f=l 


(19) 













where: 1) In the equation of W, f is the Moore- 
Penrose pseudo-inverse, W = j 

is an {J2m=i X 1 vector of the concatenation 
of 2 :”’*-^^ • • • 2) In the equation of C, the 

mdiag(X) operator sets all the off-diagonal elements 
of matrix X to 0, leaving the diagonal ones intact. 


5.3 Automatic Hidden State Shrinkage 


The shrinkage factors 5^ induce the hidden state 
shrinkage effect, which is a central mechanism to 
FABfhmm's model selection ability. Previous FAB 
methods [71 lain] report similar effects. 


In the beginning, we initialize the model with a large 
enough number M of layers, and all layers with a large 
enough number Amax of hidden state values. During 
the EM iterations, suppose there are two similar hid¬ 
den states i,j in the m-th layer, and state i is less 
favored than state j, i.e. q{z^^) < Ac¬ 

cording to r < ST- In the next E-step, the vari¬ 
ational parameters of the i-th state are more 

down-weighted than those of the j-th state 
according to (14). Erom the Eorward-Backward up¬ 
date equations (16) one can see decreases 

with h[T'^. Therefore { 9 ( 2 ,-™^)} decreases more than 

{q(zj"^^)}, which in turn causes smaller transition 
probabilities into state i in the next M-step. Subse¬ 
quently, smaller transition probabilities into state i 
make ^^d 5^ even smaller compared to 

^^cl SJF, respectively. This process repeats 
and reinforces itself, making state i less and less impor¬ 
tant. When < L, a pre-specified threshold, 

then FABfhmm regards state i as redundant and re¬ 
moves it. If all but one states of a layer are removed, 
then we know this layer is redundant and FABfhmm 
removes this layer. When EAB fhmm converges, we ob¬ 
tain a more parsimonious model with the number of 
layers M', the number of hidden states in each layers 
Kl 

and the variational distribution q* 


, • • • , the optimal model parameters W*, C* 


This intuition is formulated in the following theorem. 

Theorem 3. Suppose we have one sufficiently 
long training sequence of observations Xj- = 
Xi,--- ,xt,T 3> I. In the m-th layer, two states 
i,j are initialized with proportional initial and in¬ 
wards transition probabilities, i.e., = po&j, and 

yk,f3j. ^ = pofij^ pp < 1, and identical outwards tran¬ 
sition probabilities Then: 1) af¬ 

ter the first EM iteration, the two states have iden¬ 
tical weight vectors and outwards probabilities: Wffi = 

~ I - r 

IFj”,Vfc,/3j j. = and proportional inwards proba¬ 


bilities: d' = pi&'j, = Pi^k,j> Sut Pi < pq. 

This trend preserves in all the following iterations. 2) 
When the probabilities of state j stablize, this shrink¬ 
age process on state i is accelerating: after the n-th 
iteration, the probability ratio p^ satisfy: Pn!Pn-i < 
Pn—llPn—2- 


Proof. For notation similicity, we drop the superscript 
n for the sequence number. 


1) For initialization all variational parameters h are set 
to I. From the Forward-Backward recurrence relations 


(|16|), it is easily seen that = pof, 


Jm) Am) 


= b. 


(m) 


Thus from ([T7|,([T^, the sufficient statistics of the 
ational probability q satisfy, for Vt: 


t,3 ' 
vari- 




In the M-step, when updating W, one relevant equa¬ 
tion is 


E ^tqizTy=E diag{g(^r)} 

t lAm t t 

( 21 ) 


We single out the i-th and j-th columns of both sides 
of pTl): 



By plugging (20) into (22), we find W™ = W" 


In the next E-step, using (14), we can identify 


, n,(m) 
"■t.i 
, n,(m) 


5T D D 


which is constant for any t. We denote it as Aq. Since 
po < 1) obviously Aq < 1. 


In the Eorward-Backward routine, by examining equa¬ 
tions (fl^, one can see = PqAq/IJ^ , b^T^ = 

the sufficient 


Let Pi = poAq. Again from li7l),(@, 
statistics of the new variational distribution g' satisfy, 
for Vt: 


f/ (Tn)-. n (m)\ 

9 Ki ) =P19 ). 

t Jjn) 

^(™) '1 _ „ \ 

\^t,k ~ PW 





















When we update a and /3, since the training sequence 
is sufficiently long (T 3> 1), the pseudocount “1” 
and in their update equations can be ignored. 

Therefore one can easily obtain 

a'^ = pid', 

Obviously pi < po- That is, the new ratio of the ini¬ 
tial/inwards probabilities becomes smaller, while the 
outwards probabilities remain unchanged. 

The same argument holds for all subsequent iterations. 

2) In the n-th iteration, let the ratio in ( [2^ be denoted 
as Xn- From the above arguments, pn/pn-i = A„, thus 
it is sufficient to prove A„ < A„_i. 

If state j “survives” the shrinkage, then after many it¬ 
erations its probabilities would become stablized, i.e., 
is almost the same between the n-th and 
(n-l)-th iterations. In this condition, X„ decreases 
with pn, together with p„ < Pn-i, one can conclude 
An ^ An—1. n 

Theorem 5 sheds light on FABj/imm’s shrinkage mech¬ 
anism: given two similar components, if they are not 
initialized evenly (which is almost always the case due 
to randomness), then they will “compete” for probabil¬ 
ity mass, until one dominates the other. Another im¬ 
portant observation is, the M-step “relays” the shrink¬ 
age effect between consecutive E-steps, and thus run¬ 
ning multiple E-steps in a row would not gain much 
speed-up as observed on Latent Feature Models [5]. 

5.4 Parameter Identifiability 

Although HMMs/FHMMs are identifiable in the level 
of equivalence classes [13] , generally they are not iden¬ 
tifiable if two states have very similar outwards tran¬ 
sition probabilities and emission distributions. For¬ 
mally, if states i,j satisfy: Vfc,/3j ^, « 
p{x\z = i) Ki p(^x\z = j) , then we can alter the in¬ 
wards transition probabilities of i,j without changing 
the probability law of this model, as long as the sum 
of their inwards transition probabilities is fixed, i.e., 
/3. j -I- j = /3. j -I- f3. j. This forms an equivalence 
class with infinitely many parameters settings (one ex¬ 
treme is state i is absorbed into j), all of which fit the 
observed data equally well. Thus traditional point es¬ 
timation methods are unable to compare these param¬ 
eter settings and pick better ones. 

In contrast. Theorem 5 reveals that, FABfhmm will 
almost surely shrink and remove one of such two states, 
picking a parsimonious parameter setting (e.g. the one 
with i absorbed into j) among this equivalence class. 


6 Synthetic Experiments 

To test our inference algorithm, we ran experi¬ 
ments on a synthetic data set. The performance of 
FAB fhmm (denoted as “RFAB”, i.e. Refined FAB) and 
three major competitors, i.e., conventional FAB with¬ 
out marginalization (denoted as “FAB”), variational 
Bayesian FHMM (denoted as “VB”), and iFHMM, 
was compared. 

6.1 Experimental Settings 

We constructed a ground truth model FHMM, which 
has 3 layers, with state numbers (2, 2,3). These states 
comprise a total state space of 2 • 2 • 3 = 12 combi¬ 
nations.The observations are 3-dimensional, and the 
covariance matrix E = | °6 o°4 o V The three mean 

Vo 0 0.4/ 

matrices {W™}, initial probabilities {n"*} and tran¬ 
sition matrices {/S’”} were randomly generated, and 
omitted here. 

Two sequences of length T = 2000 were randomly gen¬ 
erated. One sequence was used for training, and the 
other was used for testing. 

RFAB, FAB and VB-FHMM are all initialized with 3 
HMMs, and 10 states in each HMM. Compared to the 
true model, this setting has a lot of redundant states. 
For VB-FHMM, we used the “component death” idea 
in [Tj, i.e. if a hidden state receives too little prob¬ 
ability mass, then it will be removed. This scheme 
is similar to FAB’s pruning scheme, despite the fact 
that the inference algorithm does not incorporate the 
shrinkage regularization. 

We obtained the Matlab code of iFHMM from Van 
Gael. We used its default hyperparameters, i.e.: a ^ 
F(1 -I- AT, 1 -I- Ht), where K is the number of hid¬ 
den states, and Ht is the T-th harmonic number; 

~ F(1, 1). In order to see how iFHMM per¬ 
forms the model selection on different hidden state 
spaces, we initialized iFHMM with hidden state num¬ 
bers AT € {4,7,10,15}. 

The maximal iterations of all models other than 
iFHMM are set to 1000. iFHMM is set to have 500 
iterations for burn-in and 500 iterations for sampling. 
Each algorithm was tested for 10 trials. 

6.2 Results and Discussions 

Three indices of the models were collected: 1) 
the eventually obtained state numbers; 2) the log- 
likelihood of the training sequence; and 3) the pre¬ 
dictive log-likelihood of the test sequence. 

The state number in the three HMMs are first sorted 
from smallest to largest, and then averaged. The av- 


Ki K2 K3 


REAB 

1.9 (0.4) 

3.0 (0.6) 

4.8 (1.1) 

FAB 

1.4 (0.6) 

2.2 (0.7) 

4.0 (1.5) 

VB 

3.2 (1.5) 

4.7 (1.2) 

6.3 (1.8) 

iFHMM/4 

4.4 (0.3) 

/ 

/ 

iFHMM/7 

7.3 (0.7) 

/ 

/ 

iFHMM/10 

9.5 (1.2) 

/ 

/ 

iFHMM/15 

13.6 (1.3) 

/ 

/ 


Table 1: Average State Numbers 


Training Testing 


RFAB 

-10937 (254) 

-12416 (425) 

FAB 

-11848 (322) 

-13647 (598) 

VB 

-11239 (336) 

-14430 (662) 

iFHMM/4 

-12256 (293) 

-14376 (329) 

iFHMM/7 

-13471 (274) 

-14687 (385) 

iFHMM/10 

-13949 (214) 

-15116 (302) 

iFHMM/15 

-15514 (201) 

-16895 (342) 


Table 2: Training and Testing Data Log-likelihoods 


erage values, and their standard deviations (in paren¬ 
theses) are reported in the following table: 

From Table 1, we can see that FAB obtained the most 
parsimonious learned models, but often it removes too 
many states, such that one HMM has only one state 
left. In contrast, RFAB is more “conservative”, as it 
keeps more states. This difference is probably because 
after marginalizing a and /3, the pseudocount “1” ap¬ 
pears in the update equations of a and /3 , which 
smooths their estimated values. VB-FHMM removed 
some states, but there are still a few redundant states 
left. It indicates that without the shrinkage regular¬ 
ization, the model selection ability is limited. iFHMM 
almost always stays around its initial number of hid¬ 
den states, no matter how many states they are initial¬ 
ized with. This indicates the model selection ability of 
iFHMM is weak. 

The training and testing data log-likelihoods are com¬ 
pared in Table 2. RFAB achieves the best log- 
likelihoods on both sequences, followed by FAB. VB- 
FHMM achieves better log-likelihood than FAB on 
the training sequence, but worse on the test se¬ 
quence, which suggests that it overfits the training 
data. iFHMM’s performance is significantly inferior 
in general, and degrades quickly as the initial K in¬ 
creases. 

Fig.2 shows how the hidden state numbers K change 
over iterations in a typical trial. The final state num¬ 
bers at convergence are shown at the end of each 
line. RFAB and FAB converge much faster than other 



Figure 2: State Number Evolves over Iterations 



Figure 3: Training Data Log-likelihood 


methods (< 200 iterations), which shows the shrink¬ 
age regularization accelerates the pruning of redun¬ 
dant states. 

Fig.3 shows how training data log-likelihood changes 
over iterations in the same trial as in Fig.2. The log- 
likelihoods of RFAB and FAB on the training data 
are quite close. The log-likelihoods of iFHMM are not 
shown in this figure, as they are well below other meth¬ 
ods (around -13500). 

7 Conclusions and Future Work 

The FHMM is a flexible and expressive model, but it 
also poses a difficult challenge on how to set its free, 
structural parameters. In this paper, we have success¬ 
fully extended the recently-developed FAB framework 
onto the FHMM to address its model selection prob¬ 
lem. We have derived a better asymptotic approxi¬ 
mation of the data marginal likelihood than conven¬ 
tional FAB, by integrating out the initial and transi¬ 
tion probabilities. Based on this refined marginal like¬ 
lihood, an EM-like iterative optimization procedure. 










































namely FABfhmm, has been developed, which can find 
both good model structures and model parameters at 
the same time. Experiments on a synthetic data set 
have shown that FAB fhmm obtains more parsimonious 
and better-fit models than the state-of-the-art non- 
parametric iFHMM and variational FHMM. 

In addition, we have proved that, during the 
FAB//jmm’s shrinkage process on hidden variables, if 
there are two very similar hidden states, then one state 
would almost surely “die out” and be pruned. This 
proof theoretically consolidates the shrinkage process 
which we observe in experiments. 
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A Appendix 


Lemma 1 . Suppose {zn,i,--- ,Zn,Tr,}n=i ^ 
quences of Bernoulli random variables, whose means 

are{pnp,‘ ■ ■ ,Pn,T„}^=i- For the n-th sequence {zn,t}, 

,Zn,T„ o.re independent with each other. Let 
Vn = Vn = £’[?/„] = Y.iPn,i- Suppose further 

that ^ oo as T„ —^ oo. Besides, there are N num¬ 
bers {yn}, yn,yn « yn- When all T„ are large enough, 
the following bounds hold: 

1 ) 


E[log(y„ + 1)] = log(y„ + 1) + + ei; 

2/n + 1 


2 ) 

E[yn logyri] = yn logy„ + ivn - yn) + £ 2 - 
Here £ 1,62 are small bounded errors. 

Proof. 


E 


1 


2/n + 1, 


r-l T„ 


= / Y\.iqn,i + Pn,i ■ t)dt. (27) 


We apply the Inequality of arithmetic and geometric 
means to nr=l(9n. + Pn,i ■ t): 


W > 0 : n(g„, + pn,i ■ t) < 


i=l 


Tn 


T„ 


n 


Let Pn denote ^, and denote 1 — . Then (27) 

becomes 


E 


1 


2/n + 1. 


/- - \T 1 1 ~ 9n"~''^ 1 

< / {qn+Pnt)^^dt = < —. 

Jo \Tn H" l)Pn Vn 

(28) 


Plugging (28) into (25), we have 


1) Using the convexity of — log(?/„ + l), we easily obtain 


E[log(j/„ + 1)] < log(E[ 2 /„] + 1) = log(j/„ + 1). (24) 

We proceed to derive a lower bound of E[log(?/„ + 1)]. 

The following lower bound of the logarithm function 
is well know: 


log a; > 1-for all a; > 0. 

X 


Substituting x with , we have 

2/n + 1 

log(2/n + 1) - log(yn + !)>!- 


2/n + 1 
2/n + 1 


E[log(y„ + 1)] - \og{yn + !)>!- ^ 

2/n yn 


Thus — A is the lower bound of the approximation 
error, which tends to 0 as T„ —>■ 00 . Hence 


3£ia > 0, s.t. VT„, E[log(y„ + 1)] > log(y„ + 1) - £ia- 

(29) 


Combining (24) and (29), we have 


|E[log(j/„ + 1)] - log(y„ + 1)| < £ia- (30) 

On the other hand, as yn ~ 2/n, the first order approx¬ 
imation of log(yn + 1) about ijn is good: 


After taking the expectation of both sides, it becomes 

E[log(y„-kl)]-log(y„-kl) > l-(yn + l)E —. 

VVn + IJ 
(25) 


E 


Vn-ei 


is commonly referred to as the negative mo¬ 


ment oi yn [3]- We apply the corollary in [5] that 


log(y„ + 1) = log(y„ + 1) + + £ 16 , (31) 

yn + 1 


where ei6 is an error of order o(y„^) that tends to 0 
when Tn increases. 


Combining (30) and (31), we have 


E 


= [ Gn{t)dt, 

Jo 


(26) 


where Gn(t) is the probability generating function of 
yn. It is known that G'„(t) = Y{iLi{qn,i + Pn,i ■ t), in 
which (jn.i = 1 — Pn,i- Thus (p^ becomes 


E[log(yn + 1)] - log(yn + 1) 


yn - yn 
yn + 1 


< |£la| + |£l6|- 


In other words, log(yn -|- I) + A approximates 
E[log(yn + l)] with a small bounded error. In addition. 




































this error tends to 0 quickly as the sequence length Tn 
increases. 

2) It is easy to verify log is convex, and thus 

E[ 2 /„ log yn] > Vn log Vn- (32) 

We proceed to derive an upper bound of E[j/„ log 

Applying the inequality log cc < a: — 1 by substituting 
X with 1^, we have 

Vn ^ ,yn y’’ 


, un ^ / yn n yn 

yn log — < yn{- -1) = V- yn- 

Vn Vn Vn 


(33) 


Taking the expectation of both sides of (33), we have 


< 


E[y„log?/„] - yn logy„ 

Vn 

Var(?/„) 


Vn 


Vn 


Pn,i 

Pn,i 


< 1 . 


(34) 


Combining (32) and (34), we have 


|E[?/„logj/„] - ?/„log?/„|<l. 


(35) 


Corollary 2. {yn}, {yn} and {ijn} are defined as the 
same as in Lemma 1. Then the following bounds hold: 


1 ) 


2 ) 


E[iogr(j/„)] 

=y„(logy„ - —) -{yn + l. logy„) 

2y„ 2 

+ -(log27r + l) + e 3 ; 


E ^logr(j/„)-logr(^z/„) 


=i: 


i»s(^) + 


hogiY^Vn) - log 


+ -{N - l)(log27r + 1) + £ 4 . 
Here £ 3,64 are small bounded errors. 


Proof. 

1) The Stirling’s approximation for the log-Gamma 
function is: 

logr(y„) = ( 2 /„-^)log?/„- 2 /„ + ^log27r + £ 3 a, (38) 


Furthermore, when ?/„ « the first order approxi¬ 
mation of log j/„ about i/n is good, i.e.: 


2 /n(log 2 /„ - logy„) 
=yn{yn - yn)/yn + £ 2 a 


I- - X , {yn-TJnf , 

= {yn - yn) H-7-1 £ 20 , 


yn 


where £20 is a small bounded error. 


(36) 


As r„ —00 , both yn and ijn tend to 00 . Therefore 


{Vn-Vr,) 

fin 

comes 


is a small bounded error e 2 b- Then (36) be- 


where e^a is an error term of 0{y~^), which goes to 
zero quickly as T„ increases. Therefore E[£ 3 a], denoted 
as £ 30 , is also a small bounded number. 


Taking the expectation of both sides of (38), and ap¬ 
plying Lemma 1, we obtain 

E[iogr(j/„)] 

=yn{\ogyn - - [ijn + l,^ogyn) + i(log27r-f 1) -b£3. 

2yn 2 2 


2) Obtained by repeatedly applying 1), and combining 
terms involving 


2/nlog?/„ - 2/nlogyn = (j/n “ Ijn) + £2a + £26- (37) 


Combining (35) and (37), we have 


E[y„ log?/„] = yn logijn + (yn - Vn) + £ 2 , 


where |£ 2 | < |£ 2 a| + |£ 2 f)| + 1 is also a small bounded 


error. 




















